DOCUMENT RESUME 



ED 230 566 



TM 830 234 



AUTHOR 
TITLE 



INSTITUTION 
REPORT NO 
PUB DATE 
NOTE 

AVAILABLE F^OH 

PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



DoranSr Neil J.; Kulick, Edward 

Assessing Unexpected Differential Item Performance of 
Female Candidates on SAT and TSWE Forms Administered 
in December 1977: An Application of the 
Standardization Approach. 

Educational Testing Service, Princeton, N.J. 

ETS-RR-83-9 

Feb 83 

54p.; Some tables may be marginally legible due to 
small print. 

Educational Testing Service , Research Publications, 
R116, Princeton, New Jersey, 08541. 
J^j^orts - Research/Technical (143) 

MF01/PC03 Plus Postage. 

Ability; College Entrance Examinations; ^Females; 
High Schools; *Item Analysis; Language Tests; Latent 
Trait Theory; Models; Standardized Tests; 
^Statistical Analysis; *Test Bias; Test Items 
^Scholastic Aptitude Test; Standardization; *Test of 
Standard Written English 



ABSTRACT 

A new approach to assessing unexpected differential 
item performance (item bias or item fairness) was developed and 
applied to the item responses of males and females to Scholastic 
Aptitude Test and Test of Standard Written English items administered 
operationally in December 1977. While the main body of the report 
describes the particulars of the present application and delineates 
the essential features of the approach, a technical appendix 
describes the standardization approach in detail. The primary goal of 
the standardization approach is to control for differences in 
subpopulation ability before making comparisons between subpopulation 
performance on test items. By so doing, it removes the contaminating 
effects of ability differences from the assessment of item fairness. 
Of the total of 195 items studied, the standardization approach 
identified only a handful as meriting careful review for possible 
content bias. Of these few, only one item exhibited a clearly 
unacceptable degree of unexpected differential item performance 
between males and females that could be attributed to content bias. 
(Author) 



********************************************************** 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original dociunent. * 
*********************************************************************** 



ERLC 



RR-83-9 



/ 



o 

CD 



R 
E 
S 
E 
A 

I 



H 



U.S. OEFARTMENT OF f OUCATKHI 

NATIONAL INSTITUTE OF EDUCATION 
EDUCATIONAL RESOURCES INFORMATION 

CENTER (ERIC) 
> This document has been reproductd as 
received from the person or organization 
originating it 

Minor changes have bten made to improve 

reproduction quality, 

• Points of view or opinions stated in this docu- 
ment do not necessarily represent official NIE 

position or policy. 



E 
P 
0 
R 
T 



ASSESmNSTm^ DIFFERENTIAL ITEM PERFORMANCE 

OF FEMALE CANDIDATES ON SAT AND TSWE FORMS 
ADMINISTERED IN DECEMBER 1977: AN APPLICATION OF 
THE STANDARDIZATION APPROACH 



Neil J. Dorans 
Edward Kulick 



'PERMISSION TO REPRODUCf: THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC). ' 



F«bruftry 1983 




Educational Tatting Sarvlca 
Princaton, Naw Jaraay 



ASSESSING UNEXPECTED DIFFERENTIAL ITEM PERFORMANCE 
OF FEMALE CANDIDATES ON SAT AND TSWE FORMS 

ADMINISTERED IN DECEMBER 1977: i 
AN-APBUCATION OF THE STANDARDIZATION APPROACH 



Nell J. Dorans 
Edward Kullck 

Educational Testing Service 



February, 1983 



fM ^This approach developed as an outcome of several conversations with Paul H. Holland 
about the various problems associated with assessing unexpected differential item 
performance. The authors are grateful for his helpful comments. This approach, 
which is still evolving, has also been shaped by the comments and suggestions of 
Thomas F. Donlon, Gary L. Marco and Nancy S. Petersen. We also thank 
Lawrence J. Strieker for his careful review of an earlier draft. Finally, 
without the programming and systems development work of Edwin 0. Blew and 

O Karen Carroll, this approach would have remained a system of equations. 

ERJC 3 



Cooyrlght 0 1983. Educational Testing Service. All rights reserved. 



ERIC 



4 



Abstract 

« 

A new approach to assessing unexpected differential item performance (item 
bias or item fairness) is developed and applied to the item responses of males 
and females to SAT/TSWE items administered operationally in December 1977. 
While the main body of the report describes the particulars of the present 
application and delineates the essential features of the approach, a technical 
appendix describes the standardisation approach in detail. The primary goal of 
the. standardization approach is to control for differences in subpopulation 
ability before making comparisons between subpopulation performance on ^est 
items. By so doing, it removes the contaminating effects of ability differences 
from the assessment of item fairness. Of the total of 195 items studied, the 
standardization approach identified only a handful as meriting careful review 
for possible content bias. Of these few, only one item exhibited a clearly 
unacceptable degree of unexpected differential item performance between males 
and females that could be attributed to content bias. 



ASSESSING UNEXPECTED DIFFERENTIAL ITEM PERFORMANCE 
OF FEMALE CANDIDATES ON SAT AND TSWE FORMS 
ADMINISTERED IN DECEMBER 1977: 
AN APPLICATION OF THE STANDARDIZATION APPROACH 

Those who develop and review the Scholastic Aptitude Test (SAT) are 
aware of the diversity of the test-taking population and attempt to construct 
tests based on a broad sampling of tasks and topics that tend not to favor any 
subgroup of the population. Donlon (1981) discussed the checks that are performed 
on the SAT to guard against favoritism towards any subgroup. In that article, 
Donlon summarized procedures used in the test development process to ensure that 
'Items or Lest questions are appropriate for various subgroups as well as the 
types of statistical checks performed to evaluate item appropriateness. 

Carlton and Marco (1982), in a review of methods used at Educational 
Testing Service to detect and eliminate possible favoritism in items, discussed 
several studies that have examined performance on SAT items across different 
subpopulatlons. Included In their review were six studies that were conducted 
to monitor differential Item performance of various groups on several forms of 
the SAT and Its companion test, the Test of Standard Written English (TSWE). 
The purposes of this monitoring are: 

(1) to ensure that the SAT and TSWE remain appropriate over time for major 
subgroups of the SAT candidate population, and 

(2) to identify possible content factors related to differential item 
performance that would help test developers construct fair tests* 

Dorans (1982) reviewed the five of those six studies that examined Black/ 
White candidate performance on SAT/TSWE items from forms of the SAT/TSWE that 
have the current content and format specifications. In the present report, 
the statistical method of standardization is used to examine whether there are 



_ 2 « 



unexpected differences In Item perfornuince across different subpopulations of 
the Scholastic Aptitude Test test-taking population. 

Unexpected Differential Item Performance 

Unexpected differential item performance exists when there are differences 
in item performance that cannot be accounted for by differences in subgroup 
ability. An item is exhibiting unexpected differential item performance when 
the expected performance on the item is lower for examinees from one group 
than for examinees of equal ability from another group or other groups. If 
we let S represent ability as measured by total score^ on the standard 
College Board 200-to-800 SAT scale (or on the 20-to-60 TSWE scale), and X repre- 
sent an item score (1 if the answer to the question is correct and 0 if the 
answer is incorrect), then an item is free of unexpected differential item 
performance when it satisfies the following equality 



(1) p (x-ljs) = P ,(X=l|s) for all subpopulations g and g', 

8 8 



where P (X-lls) is defined as the probability that candidates from subpopulatton 

g 

g who have total test scores equal to S will answer the item correctly. • For 
example, if male and female candidates with the same total test scores do not 



1 



It is recognized that use of reported scaled score as the control variable can 
be criticized because it is not a perfect measure of ability and because it is 
an internal criterion, i.e., performance on an item is related to total score 
performance in part because that item went into the determination of total 
score. Nonetheless, reported scaled score is probably the best control variable 
available for studies of unexpected differential item performance. 



ERIC 



- 3 - 



have equal probabilities of successful performance on an itenii this difference 
is taken as evidence of unexpected differential item performance for male and 
female candidates at that particular score level. Note that lack of unexpected 
^ differential item performance does not imply that there are no differences in 

item performance across subgroups of the Scholastic Aptitude Test candidate 
population. Unexpected differential item performance does not refer to differ* 
ences in overall subgroup performance on an item but rather to differences in 
conditional ltem,,pjerf ormance where the requisite condition before comparison is 
identical total test score. 

Several methods have been suggested for identifying unexpected differential 
item performance, or item bias as it Is frequently referred to in the literature. 
The handbook by Berk (1982) attests to this fact. For a single comprehensive 
review of the more popular methods , including the transformed item difficulty or 
delta-plot method, item response theory methods and chi-square approaches see 
Shepard, Camilli and Averill (1981). Most of these methods, however, have 
exhibited undesirable sensitivities to differences in overall subpopulation 
ability or differences in item quality (discrimination). Two of these methods 
(transformed item difficulty and a chi-square approach) were employed in earlier 
studies of the Scholastic Aptitude Test that were reviewed by Dorans (1982). 
Both methods are subject to misclassif ying items as unfair towards a particular 
subgroup because of methodological sensitivities to differences in subpopulation 
ability. The methodology employed in the current study controls for differences 
in subp<^pulat Ion ability through the statistical method of standardization. 

« 

Standardization is a technical term that, unfortunately, has more than one 
meaning. In one usage, standardization typically refers to a numerical oper- 
ation which transforms a set of numbers with a particular mean (average score) 

ERIC o 



- 4 - 

and standard deviation (spread of scores about the average score) to a set of 
numbers that has a certain "standard" mean and standa^rd deviation. This is not 
the meaning of standardization as used in this report. 

Rather, we shall use standardization to mean that one variable is stand- 
ardized with respect to some other variable before making comparisons between 
groups. This type of standardization enables one to control for differences in 
subpopulation ability while making comparisons of the performance of these 
subpopulations on items. The procedures used in this study require a very large 
data base in order to ensure the stability of the conditional probabilities ^ 
obtained at each score level in each subpopulation under investigation. Fortu- 
nately, there are large data bases available for the Scholastic Aptitude Test. 
Other methods of standardization may be used with smaller sample sizes, e.g. 
Alderman and Holland (1981). A general approach to assessing unexpected 
differences in item performance via standardization is described in detail in 
the appendix, where a mathematical formulation is presented and the method's 
similarities to and differences from the item response theory approach is 
discussed. 



ERLC 



Standardization 

In this section, the essential features of standardization are described. 
The conditional probability of successful^ performance on an item, P (X«l|s), 
is the raw datum for the standardization method. For each score level S, there 
is a conditional probability of successful performance. Studies of unexpected 
differential item performance focus on differences in condition^^l probability of 
successful item performance between a study group and a base group. In this 



3 



first study, female SAT candidates are the study group, while male candidates 

are the base group. 

Figure 1 contains plots of the conditional probability of successful 

\ 

performance for both males and females on an analogy item appearing on Form 
ZSA5. Male conditional percent corrects are denoted by squares (□) at each 
score level, while female conditional percent corrects are denoted by asterisks 
(*). (Note that there are no asterisks at scaled scores of 770 and 800, which 
indicates there were no females at those two scaled score levels.) In this 
particular figure, the asterisks and squares tend to lie on top of one another. 
This consistent and high degree of overlap is evident in Figure 2, which is a 
plot of differences in conditional probabilities for this item. Note that 
almost all the asterisks in Figure 2 lie very close to the line of zero differ- 
ence. This particular analogy item exhibits very little unexpected differential 
item performance. 

The analogy item portrayed in Figures 3 and 4 serves as a striking con- 
trast to that depicted in Figures 1 and 2. Here, the squares (males) are higher 
than the asterisks (females) at almost every scaled score level. In fact, 
between scaled scores of 250 and 500, the difference between the female condi- 
tional probabilities and the male conditional probabilities tends to be .2, 
i.e., the probability that a male with a given scaled score in that range will 
^answer that analogy item correctly exceeds the probability that a female with 
the same exact scaled score will answer the item correctly by the substantial 
amount of .2. Clearly, this particular item exhibits a substantial amount of 
unexpected differential item performance. 

Examination of conditional probability plots such as those depicted in 
Figures 1 and 3 and difference plots like those in Figures 2 and 4 enables 



-6- 



Conditional Probability of Successful Item Performance 
for Both Males and Females on Two 
Verbal Itens from SAT Form ZSA5 



Item 39 



Item 63 




4C0 Eoo ecc 
SCALF.p scoR?: 



t FFMALt 



Figure 1 



800 



S 1 



□ . » » 



^4o 300 



400 500 - too 

SCALED SCORE 



* t FEMALE 
0 0 MALE 



Figure 3 



700 COO 



Difference Plots of Two Verbal Items from SAT Form ZSA5 



Item 39 



Item 63 




eoo 300 



-T 1 \ r 

400 500 eoo 

SCALCQ SCORF 



700 



•00 



# t FEMALES - MALRS 



c 



OIFF X - 0.0I5S 
RMWSO - 0.0253 



Figure 2 



Ml N 

i a- 



tti l $ » »t 



eoo 300 



— I ! r — 

400 500 600 

SCALED SCORE 



•00 



t t FEMALES - MALES 



DIFF 5< - -0. I^«6 
RMWSD - 0,l7«t 



Figure 4 



one to look for evidence of unexpected differential Item performance at fixed 
score levels. In effect, the plots allow one to control foiT ability before 
comparing Item performance across subpopulatlons. Consequently, for each Item 
there is potential for unexpected differential item performance that can be 
summarized via some numerical index. One such index is the difference in 
conditional probabilities of successful performance at that score level. If 
there are 61 observed score levels, such as there are on the College Board SAT 
scale that ranges from 200-to-800 in steps of 10, then there are 61 such differ 
ences for each item. Clearly there exists a need for an economical summary of 
these differences. Standardization provides that summary. 

The application of the standardization procedure, in which the marginal 
ability distribution of the female standardization group serves as a weighting 
function, yields several summary indices of item performance. First, there is 
the observed percent correct for the female study group obtained by taking 
a weighted sum of the 61 conditional probabilities of successful performance 
observed in the female study group, where the relative frequencies at each 
of the 61 scaled score levels in the female study group serve as the weights. 
These same weights are applied to the 61 conditional probabilities observed in 
the male base group to produce an index of expected item performance for the 
female study group. The difference between and P^, * P^ - P^, is one 
index of unexpected differential item performance. If there is no unexpected 
differential item performance, should equal zero. A positive indicates 
that the study group exceeds its expected performance, while a negative 
indicates that the item is harder than expected for the study group. Since 

is a signed index, it is Insensitive to crossovers in the conditional 
success distributions of the base and study groups. An unsigned discrepancy 



- 8 - 



index that can be used with is the root mean weighted squared difference 
(RMWSD^). The RMWSD^ for an item is obtained by weighting each difference in 
conditional probabilities of successful item performance between the study and 
base groups by that difference (which is equivalent to squaring the difference) 
and by the frequency of scores in the female standardization group at each scale 
score level, summing this weighted difference across the 61 scaled score levels, 
dividing this sum by the number of candidates in the standardization group, and 
taking the square root of the result* The mathematical formula for the RMWSD^ is 

(2) RMWSD^ = ( ' N^^^(P^^ - I Z N,^^)^/2 

s=l 5=1 

where S is the number of score levels, N^^^ is the number of individuals at score 
level s in subpopulatlon f, P^^ is the conditional probability of successful 
per/formance in subpopulation f at score level s, and P^^ is the predicted value 
of P^^. Note that typically P^^ = Pj^^. where Pj^^ is the conditional probability 
of successful performance observed at score level s in the male base group. 
Given the definition of as 

S . S 

(3) Dp = ^ Nfs+^^fs - ^fs> ' ^^f8+ 



s=i s-1 



it can be shown that 



2 ^ 2^1/2 
(4) RMWSD = (Df + E N^^^CD^^ " ^^f > . / ^ ^fs*^ 

s=l S"l 

t > ■ ^ 



er|c 



13 



- 9 - 

where » P^^ - . Since, this Index Is unsigned, any difference produces a 
fs fs fs 

positive discrepancy. Consequently, every Item will have a positive RMWSD^, 

An Item exhibiting substantial unexpected differential Item performance will 

have a large RMWSD^. 

Equation (4) expresses RMWSD^ as the square root of two additive components, 

2 

the square of a constant directional discrepancy, which Is D^, and an Index 

of residual crossover. I.e., a sum of weighted squared differences In conditional 

probabilities after adjusting for the constant difference, which is the second 

2 

component in (4). While the portion is probably systematic and indicative 
of unexpected differential item performance, the residual crossover component 
may or may not be indicative of systematic unexpected differential item perform- 

• ance because it does not allow random differences to cancel out. As such, the 

at 

primary purpose of the residual crossover component is to flag an item for 
closer examination. 

A problem faced by any investigation which seeks to detect and quantify 
unexpected differential item performance, regardless of methodology, is the 
determination of what level of unexpected differential item performance should 
evoke concern. One could argue that any difference Should evoke concern. This, 
however, would be an extreme position that Ignores the fact that measurement 
systems are always contaminted by noise. In the preslent study, we examined 
distributions of root mean weighted squared differences (RMWSO-) to empirically 
determine a cutoff point which defines a substantial amount of unexpected 
differential Item performance. Examination of these frequency distributions led 
us to conclude that an Item with a RMWSD^ greater than or equal to .08 merits 
careful Investigation, while an Item with a RMWSD, less than .08 doias not 



ERIC 



14 



- 10- 



require additional study. As we acquire more experience with applying the 
standardisation approach to other data bases, a better cutoff may evolve. 

In combination, and RMWSD^ provide a statistical description of an item 
that will enable us to ascertain the degree of unexpected differential item 
performance obtained in the female study group. 

Test Form and Sample Used in This Study 

SAT Form ZSA5 and TSWE Form Ell, administered in December 1977, were used 
in this study. Stern (1977) previously described the psychometric properties of 
TSWE Form Ell; and Cook and Nutkowitz (1979), the psychometric properties of SAT 
Form ZSA5. Since the psychometric properties of ZSA5/E,11 are described in 
detail in the test analysis reports just cited, only the most salient character^ 
istics are summarized here. Both the verbal and mathematical sections of Form 
ZSA5 had fairly typical reliabilities (and scaled score standard errors of 
measurement) of .914 (32) and .916 (33), respectively, in a spaced sample of 

I, 895 candidates from the total group of 166,311 candidates who took Form ZSA5 

in December, 1977. The mean equated delta, an index of test difficulty described 
by Hecht and Swineford (1981) and Walker (1981), for tlie verbal section was 

II. 3, which indicated the test was slightly easier than intended. For the ; 
mathematical section, the mean equated delta was 12.4, slightly more difficult 
than intended. TSWE Form Ell had a fairly typical reliability of .887 in a 
spaced sample of 1,615 candidates from the total. group of 84,144 who took TSWE • 

Form Ell 'Vn June 1976. The mean equated delta was 9.3, slightly easier than 

It 

intended. 

15 

(5 



I 



- 11 - 

The basic data for this study were the item responses of 21,835 male 
candidates and 21,209 female candidates who took the 85 verbal, 60 mathematical 
and 50 TSWE items that appeared in the operational sections of the Forms ZSA5 
and Ell that were administered in December, 1977. The combined sample of 43,044 
was representative of the total group that took ZSA5/E11 at that administration. 

\ \ 

Procedurte ^ 

The focus of the present study is on the assessment of unexpected differ- 
ential item performance for female candidates on Forms ZSA5 and Ell items. In 
this particular .application of the general standardization technique, the study 
group is the female candidate subpopulation. The standardization group supplies 
the standard ability distribution used by the standardization approach. ^Any 
subgroup including a composite group or a hypothetical group can be used as the 
standardization group. Since the standard ability distribution serves as a 
weighting function, it is advisable to use each study group as its own standard- 
ization group thereby enabling use of a weighting function that mirrors the 
relative frequency at each score level in the study group. The male candidate 
subpopulation, as the majority group, was chosen as the base group, i.e., the 
subpopulation that supplies the model for item performance as a function of 
ability. The model is the conditional probability of successful performance on 
the item given ability. The largest subpopulation was used as the base group in 
order to produce the most statistically stable model of item performance given 
test score that can be attained. Table 1 contains the marginal score distri- 
bution for the female study group and male base group for SAT-Verbal, SAT- 
Mathematical, and TSWE. Note that the largest weights (relative frequency in 

16 . 

erJc 



-12- 



Table 1 



Frequency Dl.trlbutlon. «>d Su—ry St.tl.tlc. of M.1..' «>d Fea.le.' Verb.l. M.thc«.tlc.l. «»1 TSWE 
Scsled Scorss 



VERBAL 



MATHEMATICAL 



TSWE 



Scaled 
Score 

aoo 

7du 
/70 
7feD 
rbC 

/4fC 

7iO 
720 
710 

^90 
fc7s/ 

eeu 

(40 
t30 
620 
610 

f lC 
59 c 

571 
0 

tbO 

540 

52 C 

520 

51u 

50C 

490 

4i}3 

470 

46 J 

4^>u 

440 

42'* 

4^0 

410 

4 CO 

3<iG 

?bO 

3(0 
3'.0 
34C 
33J 
32C 
3U 
3wU 
290 
2mO 
270 
26C 
25i> 
24C 
230 
220 
210 
2yO 



Heen 

S.D. 



Male 



Female 



Male 



Female 



£ 


X belov 


f 


X belov 


1 


100.0 


c 


ICO.O 


1 


lOC.O 


I 


100.0 


I 


10C*.0 


1 


100. 0 


2 


100.0 


0 


100. 0 




ICrO. 3 


7 


100.0 


10 


99.9 


8 


99.9 


14 


99.8 


25 


99.8 




99. 8 


14 


99.7 


12 


99.7 


9 


99.7 


49 


99.5 


33 


99.5 


1 1 


99.4 


21 


99.4 


6 1 


99.2 


46 


99.2 


49 


98.9 


32 


99.1 




9 8.5 


75 


96.7 


74 


98.2 


49 


98.5 


66 


97.9 


57 


98.2 


165 


97.1 


150 


97.5 


91 


96.7 


47 


97.3 


224 


95.7 


191 


96.4 


1 26 


95.1 


120 


95.8 


25C 


9^.0 


2 17 


94.8 


I 5? 


93.3 


122 


94.2 


36d 


91.6 


317 


92,7 


?^*1 


90.7 


196 


91.8 


192 


89.8 


152 


91.1 


476 


67.6 


434 


89.0 


2c9 


86.4 


251 


87.8 


536 


83.9 


524 


85.4 


341 


82.3 


318 


83.9 


672 


79.3 


662 


80.8 


379 


77.5 


362 


79.1 


737 


74.2 


7 29 


75.6 


4(7 


72.0 


428 


73.6 


48? 


69.6 


442 


71.5 


967 


65.4 


895 


67,3 


505 


63.1 


487 


65.0 


1062 


58.1 


1055 


60.0 


5f9 


55.5 


543 


5*. 5 


iOC'9 


5C.9 


949 


53.0 


507 


4 6.2 


5 70 


5C.3 


545 


45.7 


611 


47.4 


1016 


41.0 


972 


42.8 


553 


?8.5 


567 


40.2 


1128 


33.3 


1021 


35.3 


554 


3 0. 8 


511 


5 c • f 


H60 


26.9 


813 


29.1 


477 


24.7 


475 


26.9 


90H 


20.5 


891 


22.7 


358 


18.9 


290 


21.3 


444 


16.8 


358 


19.6 


703 


13.6 


775 


16.0 


3«7 


11.9 


334 


14.4 


567 


9.3 


545 


11.8 


261 


8.1 


296 


10.4 


504 


5.8 


582 


7.7 


150 


5.1 


205 


6.7 


360 


3.4 


449 


4.6 


162 


2.7 


188 


3.7 


12B 


2.1 


183 


2.8 


1B2 


1.3 


220 


1.8 


275 


;o.o 


382 


0.0 


L,835 




21,209 





f X 


belov 


f 


X belov 


13 


99.9 


0 


100.0 


17 


99.9 


2 


100. 0 


17 


99.8 


5 


inc.o 


30 


99.6 


3 


100.0 


48 


99.4 


• 6 


99.9 


6w 


99.2 


15 


99.9 


92 


98.7 


15 


99.8 


151 


98.0 


32 


99.6 


120 


97.5 


24 


99.5 


134 


96.9 


38 


99.3 


154 


96.2 


40 


99.2 


148 


95.5 


44 


Q8.9 


195 


94.6 


70 


98.6 


197 


93.7 


66 


98.3 


256 


92.5 


71 


98.0 


219 


91.5 


82 


97.6 


234 


90.5 


89 


97.2 


29 3 


89.1 


122 


96.6 


205 


87.7 


134 


96.0 


345 


86.1 


146 


95.3 


655 


83. 1 


39C 


93.4 


429 


81.2 


221 


92.4 


397 


79.3 


246 


91.2 


435 


77.4 


267 


90.0 


376 


75.6 


265 


88.7 


513 


73.3 


324 


87.2 


1080 


68.3 


692 


63.9 


500 


66.0 


346 


82.3 


497 


63.8 


391 


80.5 


568 


61.2 


4 84 


78.2 


601 


58.4 


468 


75.9 


599 


55.7 


496 


73.5 


1109 


50.6 


1314 


68.7 


619 


47.8 


542 


66.2 


625 


44.9 


567 


63.5 


562 


42.3 


539 


61.0 


514 


39.9 


545 


5 8.4 


1096 


34.9 


1292 


52.3 


541 


32.4 


619 


49*4 


460 


30.3 


533 


46.9 


471 


28.2 


562 


44.2 


483 


26.0 


6 57 


41.1 


494 


23.7 


603 


38.3 


842 


19.8 


1035 


33.4 


449 


17.8 


536 


30.9 


446 


15.7 


615 


28.0 


419 


13.8 


611 


25.1 


319 


12.4 


480 


22.8 


657 


9.3 


1070 


17.8 


247 


7.8 


606 


14.9 


271 


6.5 


527 


12.5 


232 


5.5 


410 


10.5 


249 


4.3 


473 


8.3 


256 


3.1 


457 


6.1 


336 


1.6 


619 


3.2 


112 


1.1 


203 


2.3 


97 


0.6 


182 


1.4 


69 


0.3 


136 


0.8 


34 


0.2 


67 


0.4 


29 


0.0 


71 


0.1 


9 


0.0 


24 


0.0 



Mele 

f X belov 

0 100.0 

c no.o 

ivv..: 

0 100.0 

C liC.f" 

U 10-J.C 

C 100.0 

0 ICO. 3 

0 100.0 

C 100.0 

C 100.0 

r io:.c 

C IJC.O 

0 K'O.O 

C 1C0«0 

0 lOC.O 

0 ioc.r> 

IjO.O 

0 100.0 

642 97.1 

375 95.3 

Q 95.3 

477 93.2 

565 90.6 

6H 90.3 

626 87.4 

706 84.2 

759 80.7 

732 77.3 

144 76.7 

767 73.2 

808 69.5 

d96 65.4 

853 61.4 

239 60.4 

764 56,9 

834 53.1 

737 49.7 

777 46.1 

296 44.8 

743 41.4 

7M 37*9 

725 34.6 

C 34.6 

711 31.3 

311 29.9 

617 27.1 

617 24.3 

579 21.6 

269 17.8 
15.8 

445 13.8 

418 II. fl 

360 10.2 

1«7 ^-3 

310 7**' 
263 

270 5.4 
1175 C.i? 



Femele 

f X belov 



0 

0 
0 
0 
0 
0 

c 
c 

n 
0 

6 
c 

0 

720 
393 
0 

468 
599 
68 
678 
725 
710 
864 
151 
802 
844 
925 
883 
221 
765 
604 
8C2 
822 
252 
738 
731 
665 
0 

646 
265 
579 
574 
552 
463 
214 
39w» 
3<>3 
365 
327 
157 
237 
728 
226 
963 



100.0 
lOC.O 
100.0 
} OC .0 
ICO.O 

i:o.o 

100.0 
100. 0 
100.0 
ICO.O 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
ICO.O 
100.0 
100.0 
96.6 
94.8 
94.8 
92.5 
89.7 
89.4 
86.2 
82.8 
79.4 
75.4 
74.7 
70.9 
66.9 
62.5 
58.4 
57.3 
53.7 
49.9 
46.1 
42.3 
41.1 
37.6 
34.2 
31.0 
31.0 
26.0 
26.7 
24.0 
21.3 
16.7 
16.5 
15.5 
13.7 
11.8 
10.1 
8.5 
7.8 
6.7 
5.6 
4.5 
0.0 



415.1 
107.4 



407.9 
108.1 



21,835 

472.9 
118.8 



21,209 

420.1 
106.9 



21.835 

405.1 
111.2 



21,209 

414.1 
109.2 



17 

mc 



- 13 - 



the female study group) tend to be given to scores between 240 and 550 on the 
verbal scale, scores between 260 and 540 on the mathematical scale, and scores 
between 30 and 60 on the TSWE scale (a relatively large weight is also assigned 
to 20), 

Results 

SAT Verbal Results 

Table 2 contains listings of four indices described earlier, P^, P^, D^, 
and RMWSD^, and the observed percent correct in the male base group, P^, for the 
85 verbal items of Form ZSA5. In addition, it includes the means and standard 
deviations of these five indices displayed by item type. 

The first row of the summary portion of Table 2 contains statistics based 
on all 85 verbal items. Note that mean P^ and mean are equal to two decimals. 
The difference between mean (.00) and mean RMWSD^ (.05) is attributed to the 
fact that RMWSD^, unlike D^, is an unsigned index of discrepancy that weights 
and sums any squared differences between P^ and P^ regardless of which value is 
larger and thus prevents cancellation of positive and negative differences. On 
the other hand, the signed index expresses the amount by which total differ- 
ences in one direction exceed total differences in the other direction. 

The next row in Table 2 displays the means and standard deviations of 
the five indices computed on the vocabulary items only. Again, mean P^ and mean 
P^ are nearly equal. Both discrepancy indices are also small. The vocabulary 
items can be divided still further into antonym items and analogies items. Mean 
percent correct on these item types are even less related to scaled scores 
than previous item groupings, and so differences in mean percent correct are 



18 



-}A- 



Table 2 



Listing of Item Difficulty and Discrepancy Indices and 
Sunmary Statistics for Verbal Items from SAT Form ZSA5 









p 


f 


11EH t 




8 CCfMECf 


ESI f CO^^Ml 


1 


ANTCNVN 


0.90C5 


0.8607 


t 


AMJMVN 


1.7143 


^.6864 


} 




0,(I098 


0.7932 


i 




0.7062 


0.C928 


5 




0.7466 


0.7C66 


* 


ANICSVN 


U.C899 


0.74C1 


T 




C.5125 


0.5658 


• 




0.4818 


0.4787 






C.5796 


0.5156 




A .irNVN 


C.?223 


0.1910 


11 




0.3449 


0.3452 


12 




0.2924 


C.2664 


13 


ANIJ^VH 


0.3C09 


0,243« 


14 


AmICSVH 


O.Cr86 


C.0801 


15 


AMONVH 


0.1383 


r. 1256 


le 


StNT 


ccx 


0.7825 


0.8373 


17 






C.e9t9 


0.7435 


IS 




CCN 


C.6951 


0.6774 


1^ 


StNl 


CCH 


3.6914 


0.7wlO 


2w 


St NT 


CGH 


I..4C32 


0.4»955 


21 


^cAJ 


CCH 


r.5361 


0.5158 


22 


f £AU 


CCN 


0.^918 


0.6866 


23 


PtAO 


CC«1 


0.C321 


0.6423 


2^ 


liFAO 


CCH 


0.5616 


C • 5 70 1 


25 


f f A '> 


CLS 


0.2607 


0.2360 


2t 


»<t AO 


CfH 


0.C917 


C. 1075 


21 


f LAD 


CCN 


J.2331 


(*.2399 


28 




CCJi 


C.1142 


f>. 1422 


29 




CuH 


C.13C7 


0. 1910 


30 


P i V) 


rcM 


0.1^68 


0. 1 395 


31 


SEM 


rcH 


C •t;455 


0. 8345 


32 


StM 


CCH 


• f .482? 


0 .5C69 


33 


St M 


rcH 


C.4469 


V. S Jmm 


34 


Str^T 


CCN 


C.3343 


C . 9c79 


35 


SEM 


CCN 


0.13C8 


0. 1 1 86 


36 


AN'AlCay 


0.7601 


w . 1 9 9^ 


37 


ANALCGV 




0. 7005 


3d 


A'^Af CGV 


0.6261 


0.6401 


3« 


ASAICGV 


0.5276 


0.5117 


4C 


Af.ALCGV 


^•4669 


0.4515 


41 


ANAiUGV 


0,'^405 


0.3647 


42 


ANALCGV 


0.2447 


0.2112 


43 


ANAICGV 


0.1367 


0.1656 


44 


ANAtCGV 


0.1496 


0.1979 


45 


ANAiCGV 


0.(/67l 


0.0884 


46 


AMONVN 


0.8831 


0.8607 


47 


AN1LNVN 


C.8085 


0.7684 


4fl 


AU1CNVN 


0.7160 


C.8034 


49 


A*ilCNVN 


C.7C3 7 


C.f 639 


5C 


ANTfNVN 


0.4443 


C.4015 


51 


ANTCNVN 


C.475C 


0.4631 


52 


AN7CNVN 


0.4575 


n,3872 


93 


A*iTCKV(4 


C.3569 


0.3755 


54 


ANiONVn 


0.1283 


0.1732 


55 


AilCNyW 


C;lC79 


0.1038 


56 


SENI 


CCN 


0.7701 


0.8155 


57 


S€N1 


CON 


0.6744 


0.6244 


5t 




CON 


0.7046 


0.7139 


59 


SE^^I 


CGN 


0.4493 


0.4439 


eo 




CCiN 


0.2318 


6.2051 



OUF f CCPf'ECT 


8nso 


X CORRECT BASE GROUP 


0.03«8 


0.0908 


0.8702 


0.0279 


C.0406 


0.7045 


C.0U4 


C.C442 


0.8030 


0.0134 


G.0293 


0.71C4 


0.0600 


C.0693 


0.7223 . 


-0.0511 


C.C619 


0.7531 


-C.0533 


0.0661 


0.9777 


0.0031 


e.0273 


0.4926 


0.0640 


C.C697 


0. 5276 


C.0313 


0.C547 


0.1944 


-C.0002 


0.0339 


C.3531 


0.0261 


0.C401 


0.2 745 


0.0570 


0.C794 


0.2517 


-C.0015 


0.0435 


0.0801 


0.0127 


0.C524 


C.1268 


-0.0547 


0.0716 


C.8523 


-0.0466 


0.0603 


0.7572 


C.0176 


O.C309 


0.6919 


-0.0094 


0.0307 


0. 7147 


-0.0923 


C.1054 


0.5047 


0.0203 


C.0391 


0.5287 


0.0052 


0.C362 


0.6993 


-C.0102 


0.0330 


0.6943 


-C.0085 


0.0316 


C.58C1 


0.0248 


0.0471 


0.2420 


-0.0158 


C.C381 


0.1111 


-0.0097 


C.0428 


0.2482 


-C.0280. 


0.0503 


0. 1450 


-0.0603 


C.0760 


0. 1974 


C.C173 


0.C389 


1430 


0.0111 


W.0403 


C. 845T 


-C.0246 


0.C409 


C. 51 tt2 


0.0681 


r .c788 


0. 3899 


0.0104 


0.C369 


U. 3359 


0.0122 


U. CZ '9 


0. 1 255 


-0.0752 


0.C843 


0.8443 


0.9429 


0.C542 


0.7149 


-0.0140 


0.0293 


0.6541 


C.C159 


C.C253 


C.5274 


0.0194 


0.C343 


0.4643 


-0.0242 


0.0381 


0.3756 


0.0334 


0.0466 


0.2193 


-0.C289 


0.0480 


0.1733 


-0.0483 


0.0691 


0.2C47 


-0.0212 


0.r)34 


0.C912 


0.0229 


0.0327 


0.8724 


0.0401 


0.0520 


0.7823 


-0.C874 


C.C979 


0.8173 


0.0398 


0.0489 


0.6773 


0.0428 


C.0561 


0.4153 


0.0119 


0.0393 


0.4760 


0.C703 


0.0832 


C.3980 


-0.01S6 


C.032T 


0.3866 


-0.0449 


0.0714 


C.1781 


0.C041 


0.0310 


0.1075 


-0.0494 


C.060* 


0.8299 


0.0900 


e.0661 


0.6392 


-0.0094 


0.0260 


C.73C3 


C.0493 


0.0569 


0.4984 


0.0267 


0.0424 


0.2125 



ERIC 



19 



-15- 



Table 2 (continued) 



I1CM • HHH lYfC 



t CC»N|CI 



OUF f CCfliKI 



|i«tiSO 



Z CORRECT lASE GROUP 



41 




, 0.4340 


62 


a:«aiosy 


0.4I4I 


«3 


ANALOGY 


€•4249 


44 


ANAICGY 


0^e074 


4S 


ANALCGY 


0^4e42 


€4 


AUAICGY 


0.1344 


47 


ANAl OGY 


0.2206 


€• 


ANAICSY 


C^1474 


49 


Af.Al OGY 


^.U44 


70 


ANALOGY 


C^I493 


71 


PtkO CCN 


C.474C 


72 


HE AO rcN 


0^e042 


73 


PlkO CiM 


0^4140 


74 




C^7324 


n 


tEAO COH 


0^2214 


76 


FEAD C0>4 


0^2ve4 


77 


htko ten 


0^£244 


74 


PFAO CCH 


0^4006 


7$ 


fifAtI CC*i 


€•£430 


40 


fCAD COM 


0^4441 


41 


i*EAO CCM 


0^I424 


42 


MAO CM 


3 •2443 


43 


liCAD CON 




44 


lif AO CON 


C^226l 


44 


HEAD CCN 


0^U07 



0^4274 
0^4402 
0^7440 
0^4377 
0^4244 
0^|247 
0^2341 
0^I274 
0^I4I5 
0.1403 
0^4344 
C^4447 
0^4744 
0^#934 
0^2244 
0^1450 
0^4412 
0^3744 
0^S724 
0^4322 
0.1743 
0^2741 
0^2414 
€•2^44 
0^1744 



0^0043 
<>0^021l 
-0.1444 
-C^0244 
-€•0437 

0^0044 
-0^0144 

o^om 

0^0073 
0^CI40 
0^0401 
0^0444 
0^0494 
0^094« 
-0^0070 
O^OIU 
0^0493 
0^0247 
0^0704 
0^O924 
0^0C42 
-0^03S4 
-C^0319 
-0.0234 
-0^0197 



^•0299 
0^0344 
C^17«2 
0^C«4i 
C^CT79 
(i^0334 
C«C434 
0.0449 
0^03 94 
€•0373 
€•0944 
0.0979 
0^0944 
0^0440 
0^03U 
0^0393 
0^0T32 
^•0344 
0^0747 
^•0420 
0^0317 
C^C440 
0^0427 
^•0344 
0^C393 



0^ 4424 

0.4910 

0^^111 

0^e447 

0.5349 

0^1349 

0^249& 

0 1242. 

0^1444 

0^1494 

0^4927 

0^9702 

0^4464 

0^7047 

0^2344 

0^20n7 

0^ 974f 

0^ 3402 

C^4II44 

0^4437 

€•14^0 

0^2444 

0^2920 

0^2402 

0^1413 



icta Typ« 


Vo. of 
Itaas 


X 


SD 


X 


SD 


X 


SD 


X 


SD 


X 


SD 




All Verbal 


S5 


.45 


.24 


.45 


.24 


.00 


.04 


•05 


.02 


.46 


••25 




Vocabulary 

Antonyas 

Analoglas 


45 
25 
20 


.46 
.50 
.41 


.26 
.25 
.26 


.46 
.49 
.43 


.26 
.25 
.27 


.00 
.01 
-.02 


.04 
.04 
.05 


•05 
.05 
•05 


.03 
.02 
.03 


.47 
.50 
.44 


.27 
•25 
.28 




Reading 
Stntanct Coap. 
Reading Paaaagaa 


40 

15 
25 


.44 
.56 
.37 


.23 
.21 
.21 


.44 
.56 
.36 


.22 
.22 
.19 


.00 
-.00 
.01 


.04 
.04 
.03 


•05 
.05 
• 05 


.02 
•02 
.01 


.45 
.57 
• 37 


.23 
.22 
.19 





ERIC 



20 



- 16 - 



more likely to appear among antonyms or analogies item type groupings than in 
the vocabulary items as a whole or the entire verbal test. 

The next two rows of Table 2 list the means and Standard deviations of 
the five indices across the antonyms and analogies item types, respectively. 
The values of RMWSD^ are still of approximately the same size as before. The 
magnitudes of are slightly larger than before, yet still small in an 
absolute sense. <^ 

The statistics for reading, the other section in the verbal test, and the 
two item types that compose it, sentence completion and reading comprehension, 
and the corresponding statistics from their items are posted in the last three 
rows of Table 2. None of these indices exhibit disconcerting amounts of unex- 
pected differential item performance. 

Even if the overall level of unexpected differential item performance 
In a set of items is tolerable, there may be some small number of items which 
exhibit substantial unexpected differential item performance that is not readily 
detectable from the means and standard deviations of discrepancy indices such a9 
RMWSD^ and D^. For an item level analysis, careful examination of the frequency 
distribution of a discrepancy index such as RMWSD^ can be informative. A combi- 
nation numerical/pictorial display of the frequency distribution of the RMWSD 
index on all verbal items grouped by subscore and by item type is presented in 
Figure 5. The floating histogram in Figure 5 is a clear presentation of the 
RMWSD^ index that can be used to identify individual items that exhibit 
unusually high amounts of unexpected differential item performance. Note how 
the single analogy item with a RMSWD of .18 clearly stands out in this figure. 



2i 



Flsurt 5 



Nuwrictl «nd Plctorltl DltpUy of Frtqutnclti of Root Hetn Wtlghttd Squared Dlf f«r«nc«s(W«SD) 
Betveen the Condltloful Prob«bllltl«» of Succctt for FesaU tnd Malt Candldatta on 
Verbal Itcva froa Form ZSA5 Administered In December 1977 



Numerlcel Frequencies Grouped by Item Type Floatlos Hlatosrama by IteM Type 

















1 vciuee 1 


Vf>t*M hil 1 M W 1 




leadli^ t 


VERB 


VOCAB 


ANIM 


ANAL 


UAD 






1 OS 1 


1 




















t IMWSD 1 


ANTM ANAL t 

1 


SNCP 


IDCF 1 




t 












1 .20 1 






















1 •19 1 








1 


1 




1 








1 'IS t 


A 




















1 .17 1 






















t .16 1 






















1 .15 1 






















t .14 1 






















t .13 t 






















1 .12 1 








1 1 








1 


1 




1 .11 1 




S 




1 1 


1 


1 










1 .10 1 


0 
















< 




1 .09 1 








1 7 


4 


2 


2 




1 




1 .08 1 


00 AA 


s 


RR 1 


1 s 


5 


4 


1 




2 




1 .07 1 


0000 A 


ss 


R 1 


1 s 


2 


2 






3 




1 .06 1 


00 


sss 


RRR 1 


1 13 


9 


5 


4 








1 .05 1 


00000 AAAA 




RRRR 1 


t 25 


10 


5 


5 


15 


4 


11 


1 .04 1 


00000 AAAAA 


ssss 


RRRRRRRRRRR 1 


i 21 


13 


6 


7 




4 




1 .03 1 


000000 AAAAAAA 


ssss 


RRRR 












1 .02 1 






















1 .01 1 






















1 .00 1 








1 .04 


.03 


.03 


.03 


.04 


.035 


.04 


t Mode 








1 .05 


.05 


.05 


.05 


.05 


.05 


.05 


1 Heen 








1 .02 


.03 


.02 


.03 


.02 


.02 


.01 


1 S.D. 




























1 



ll 



Legend: 

No. of 

Item Type Items Abbreviations 



Verbal Score 

Vocabulery Subscore 

Antonyms 

Analogies' 
Reading Subecore 

Sentence Completion 

Reeding Comprehension 



85 (VERB) 

45 (VOCAB) 

25 (ANIM) (0) 

20 (ANAL) (A) 

40 (READ) 

15 (SNtP) (S) 

25 (RDCP) (R) 



- 18 - 



An alternative pictorial representation of the distribution of this index 
that conveys even more information is given in Figure &• In this figure, where 
each item type is denoted by a different symbol, the RMWSD^ for an item is 
represented by the length of the line from the origin to the point representing 
that item. To supply a frame of reference, three arcs of equal RMWSD^ are drawn 
on the plot for the values .08, .16 and .24. Items falling within the smallest 
arc exhibit a fairly typical amount of RMWSD^. Items falling between the 
smallest and middle arc should be examined more closely. Items falling outside 
the middle arc are very unusual and clearly exhibit a large amount of unexpected 
differential item performance. 

As described earlier, the RMWSD^ for each item can be expressed as the 
square root of two additive components, the square of a constant directional 
discrepancy, which is D^, and an index of residual crossover, i.e., a sum of 
weighted squared differences in conditional probabilities after correction for 
the constant difference, which is referred to as the variance of the weighted 
differences. (See equation (4).) Projection of each point in Figure 6 on the 
horizontal axis yields the D^, the difference between and P^, for that item. 
Projection of that same point on the vertical axis yields the standard deviation 
of the weighted differences, the index of residual crossover. Hence, the 
location of each point in Figure 6 indicates not only the degree of unexpected 
differential item performance (RMWSD^), but also the extent to which that 
RMWSD^ is due to a constant difference between the P^ and P^ curves (and the 
direction of that difference: D^), and the extent to which the item exhibits 
residual crossover, the height of the point above the horizontal axis. 

The analogy item depicted in Figures 3 and 4 is the only verbal item which 
falls outside the second arc of Figure 6. It is also the item in Figure 5 that 



23 



-19- 



Figure 6 



Plot of Root Mean Weighted Squared Differences (RMWSD ) Between 
the Conditional Probabilities of Success for Male and Female 
Candidates on Verbal Items from SAT Form ZSA5 



ITEM DISCREPANCY 
U) FEMALE 



INDICES 



(J 
z 
u 

(T 
LJ 
L. 
U. 
•«* 

D 



O 




25 
* 

□ 



-0 20 -0 
t ANTONYM 

n sr.N'r COM 

L READ COM 
+ A.NiAI-OCr 



15 



-D. 



05 0 00 0 . 05 

DIFF X CORRECT 



0 10 



D. !5 



D 20 



0 25 



*RMWSD equals the distance from the origin to the point representing the Item. 
?^J^ct2on of elch point on the horizontal .xls yields the ^i^^-""^^"-" 
^ ^ r^^^ ^^^^ Projection of each point on the vertical axis 



and P 



- ^ U^f 

yields the standard deviation of the, weighted differences, an Index of residual 



crossover. 



ERIC 



- 20 - 



is off by itself in the floating histogram at the top where it has a RMWSD^ of 
•1792. Clearly this index indicates a highly undesirable amount of unexpected 
differential item performance for this analogy item. 

In Figure 6, the analogy item outside the second arc is just above .05 on 
the vertical axis and at approximately -.17 on the horizontal axis. Hence, this 
item is exhibiting little residual crossover, and a very sizeable amount of 
constant difference. Examination of Figure 4 corroborates these observations. 
This analogy item exhibits a substantial constant amount of unexpected differ- 
ential item performance. 

In contrast to this Item, most of the items fall within the first arc, which 
indicates that most of the items, 80 out of 85 in fact, exhibit acceptable levels 
of unexpected differential item performance. Of the four that fall between the 
inner and middle arcs, an antonym item that has a positive and an analogy 
item with a negative are close enough to the inner arc to be considered as 
exhibiting acceptable levels of unexpected differential item performance. The >^ 
remaining two items, a sentence completion item and an antonym, however, merit 
some careful examination. Like the analogy item outside the middle arc, these 
two items haye negative values, which indicate that female candidates 
perform poorer than expected on these items. 

On the analogy item that lies outside the middle arc, female candidates 
performed far worse than expected: - .63 vs. « .80. Inspection of the 
content of this particular analogy item revealed potential content bias against 
female candidates, as it required some knowledge of hunting and fishing, two 
traditionally male-oriented recreational activities. 



2o 



- 21 - 



On the sentence completion Item, female candidates performed somewhat 
lower than expected: * .40 vs. * .50. Inspection of this item itself 

revealed that the subject matter of the item, nuclear power politics, might be 
something that males traditionally have shown more interest in than females. It 
is not apparent, however, why this particular subject matter should affect 
the performance of female candidates on this item. 

Finally, on the antonym item, female candidates performed below expectation: 
P^ - .72 vs. P^ - .80. Examination of item content, however, provided no plaus- 
ible explanation for this difference. 

In sum, this analysis of the 85 verbal items on Form ZSA5 uncovered only 
one Item that exhibited a substantial amount of unexpected differential iter 
performance that probably could be attributed to item content. Only two other 
items exhibited enough unexpected differential item performance to merit exam- , 
inatlon. Most of the 85 verbal Items exhibited little unexpected differential 
item performance for female candidates. 

SAT"Mathematical Results 

Table 3 contains listings of the five indices, P^, P^, D^, RMWSD^, and P^ 
for the 60 mathematics items on Form ZSA5. In addition, these indices are / 
summarized by item type at the bottom of this table. The^^^st row at the 
bottom of Table 3 contains means and standard deviations based on 59 mathematics 
items. One math item was excluded from tK*s analysis because the percent of 
female candidates responding correctly to the item was less than .05. 

Unlike verbal test results, mean P^ (.42) for female candidates and mean 

P (.51) for male candidates are very different, reflective of the difference 
m 

between the mathematical ability distributions for males and females, and 



ERIC 2G 



-22- 



Table 3 



Listing of Item Difficulty and Discrepancy Indices and Summary 
Statistics for Mathematical Items frojn SAT Form ZSA5 



HEM f 


llfH llfPE 


C COf*£CV 




1 


fiEG '^A iH 


U* ? w43 


U . f vie 


2 


PfG HATH 


0*55^2 


V . C 6 9Y 


3 


l^tG MATH 


0« 5 J7T 


W . 9 V 9D 


4 


(ES MITH 


C* 6456 


Q. c 1 OS 


5 


P^G HATH 


0*4 365 


o . 9c vm 


i 


F G HATH 


D • c r 4 1 


n * 1 T1' 

*i . C 1 ff 1 




FrG HATH 




V.C f 9^ 


• 


pcg hath 


0*^966 






r E V* H 4 T H 


A 1 9A 


S491 
V . 9^C 1 


10 


FCG HATH 


' * • % c ? B 


. «l . ^ c r U 


11 




C*4£ 93 


w . % r Oc 


12 


rtG hath 


? • % f 0 ff 


n 41(1 1 

U. ^ 91 1 


-13 


F^G HATH 


K Q C ^ 


0 .5581 


14 


^eg math 


O ^11' 


0. 3949 


15 


HATH ' 


V • *t T f 


C^.4i^d9 


16 


^EG HATH 


A )9CT 
V • ^ r 


0. 2995 


17 


I«EG H4TH 


o • 1 *t c U 


^•1672 


19 


Feb I'ATH 


' J • 1 % f 1 


0.1941 


1% 


PtG HATH 


n 9 K^A. 


0.2403 


2C 


rrG HaIH 


> "^oT ^ 
Jm %'*f f V 


f* - ' 09fl 


21 


Ft u n A 1 n 


1 • Ic OS 


C . 1 T27 


72 


{•E G HATH 


A 1 laT 


0.1114 


23 






0.0821 


2* 


( F G HA TH 


U • W > * c 


0.1121 


25 


PLw HAIH 






26 / 


RCG HATH 


V • 0 ^ V 1 


3.91 50 


2T 


^eg hath 


/ \Jm 9 J ^ r 


0. 5903 


26 


C C ^ fej A V 4J 

h c u "^4 1 n 


*^ fit 
•/•CI r ^ 


^. 9C55 


2S 


rtvi hath 


Ua 09£ Y 


0.6 951 


30 


PEG HATH 


'J • C c re 


r. 6 561 


31 


PCi* hath 


V m 9> 1 1 


0.5856 


32 


t r f* MA Y 


V* 9 t O 


C . 5 6 03 


99 


WW 4i« 1 C^r 


0 • 66 7*4 


»^.6 639 


J* 




0*6594 


0.5790 


35 




Da 6669 


C. 66?8 


36 


i A(>Li Tf 




U.5960 


37 




0«7193 


C^7740 


38 


gjAfvTCKP 


0.4279 


r.4l52 


39 


ojantchp 


0*5955 


0.5432 


4C 


OUAMTCHP 


0.5792 


0.5963 


41 


OUAhTCMR 


0.3825 


J.6C35 


42 


0JAr4TCHP 


0.5061 


0.5364 


43 


OjAr;TCHP 


C*5?C3 


0.5281 


44 


ouantckp 


C.4110 


0.4040^ 


45 


OJANTCHP 


0.46 00 


0.45^6 - 


41 


CJA\TCHP 


t.3747 


C.3962 


4T 


OUANTCHP 


0.1866 


0.1943 


48 


QUANT CMP 


0.2591 


0.2483 


49 


cjantcmp 


0.3335 


C.3175 


50 


OUANTCHP 


0.1461 


0.1775 


51 




0.1615 


0.1986 




CJANTCHP 




0.2149 


V 53 


PtQ NAtH 


C.5176 


0.4521 ' 


54 


fi|5 MATH 


0.1826 


C« 1 767 


55 


fcEC MATH 


6.?214 


0.3803 


56 


ItLC.MATH 


0.C875 


0.i:»42 


5T 


f^(G MATH 


0.1543 


^•1513 


58 


tE^ MAtH 


0.C991 


0.1068 


^\ 


HEw MATH 


O.C782 


0.C872 


60* 


8EC MAtH 


0.0490 


0.0532 




Vo. of 




X SD 


lc«« Typ« 




X SD 



Co«p«ri«on 



59 
20 
39 



•42 



.22 



•45 .10 
.40 .24 



.42 
.45 
•41 



.21 
•16 
•23 



OJFf f CC9»eCI 

€•0528 
-0.0756 
-0.0279 
" 0.0289 

-0.0335 
0.C569 

C.C0C6 
--0.0465 

0.07C3 
-0.0012 ' 
-0.0099 

0.027* 

0.0408 

0.0168 
-U.0332 

0.0302 
-0.0251 
-0.0370 

0.C143 

0.0':42 
-0.0441 
-0.0027 

0.0027- 
-0.0184 
-C.C046 

0.0051 
-r.Cf96 

' -3.0422 
-3.C288 
0.C055 
O.OlAiu 
-0.COJ5 
0.5904 
0.0231 
t.U230 
-0.0557 
0.0127 
0.C553 
-0.0180 
-O.U210 
-C.0303 
0.0022 
C.0C7C 
0.0004 
-0.0114 
-0.0077 
0.0108 
0.0160 
-0.0313 
-0.0370 
-0.0099 
0.0652 
C.0U99 
-0;0589 
'•O.Ol*? 

0.U027 
-0.0177 
-0.0089 
«>0.0042 




Q aotul«r Iteth 

FRIC 

UBHrajxcliidsd frcm ouMutry Htotiocico bocAiMM i«88 than 



.00 
.00 



.03 
.03 



PHWSO 

0.0276 
0.0863 

0.C420 

0.C405 

0.048C 

0.0675 

0.0254 

0.0603 

0.C926 

0.0346 

0.0345 

0.0423 

C.0584 

0.C338 

€.0448 

0.0553 

C.0395 

0.0454 

0.0449 

0.0214 

0.0570 

0.^244 

0.0195 

0.0338 

C.C224 

C.0300 

0.07<»1 

O.C249 

C.0546 

0.0470 

0.0269 

C.0330 

0.C265 

G.C984 

0.0375 

0.0413 

0.0665 

040308 

C.0693 

0.C323 

0.0390 

0.-0443 

0.0296 

0.0289 

0^0312 

0^0258 

0^03 69 

C^0330 

0^0425 

€•0417 

0.0752 

0.0289 

0.0721 

0.0209 

0.0701 

0.0258 

0.3354 

0.0294 

0.C264 

O.0177 



X 

.04 
•04 
•04 



SD 

• 02 

•02 

• 02 



I CORRECT BASE GROUP 

0^7569 
0^7261 
0^6879 
0^7029 
0.635) 
0.7134 
C.7774 
0.6629 
0.6616 
0^5209 
0.5721 
C.56 71 
0.6468 
0.4998 
0.5727 
0.3816 
0.2513 
0v2570 
r.3223 
0.1354 
0.2425 
0.1661 
0.1226 
0.1593 
C.0957 
0.8776 
' C.6898 

0.9696 
0.7918 
0.7315 
0.6818 

0.6374 

J. 7609 

0.6537 

0.7345 

0.6971 

0.8499 

C.5375 

0.6248 

C.6997 

0.6922 

0.6377 

0.6306 
" 0.4987 

0.5781 

0^4980 
" 0^2842 

0.3284 

0.3499 

0.2363 

0.2635 

0.2704 

C.5515 

0.2e98 

0.4690 

0.1735 

0.2099 

0^14e8 

0.113a 

0.0788 




.54 
• 49 



.19 
.24 



.05. 



27 



- 23 ^ 

illustrative of. the need to correct for this difference prior to comparing 

male and female item performance. Note that mean (.42), in contrast to mean 

P , is very close to mean P^, demonstrating the effectiveness 6^ the standard- 
m *^ , 

ization procedure in this regard. Both and RMWSD^ have very low means, 
indicating little overall difference, as expected, between the sexes on the 
items. 

The next row of Table 3 displays the means and standard deviations of the 
five indices computed on the 20 quantitative comparison items. Female candi- 
dates' mean percent correct is extremely close to their estimated mean (i.e., 
mean » .00). The mean value of RMWSD^ is only .04. 

The last row of Table 3 presents the data for 39 regular math type items. 
Item #60 was excluded from the analysis because the female candidates' percent 
correct on this item was less than .05. These means and standard deviations 
sugge^ that little unexpected differential item performance is present. 

Figures 7 and 8 contain pictorial and numerical displays of the discrepancy 
indices for both quantitative comparison and regular mathematics item types. 
Neither the floating histogram in Figure 7 nor the plot in Figure 8 reveal 
any items that exhibit the substantial degree of unexpected differential Item 
performance observed for the one analogy item in the verbal test. Only two 
Items, in fact, fall outside the inner arc in Figure 8. Female candidates 
performed better than expected on one item, but more poorly than expected on the 
other item. The plots of male and female conditional percent corrects and the 
aifference plot for the former item are given In Figures 9 and 10, respectively, 
while Figures 11 and 12 are the corresponding plots for the latter item. Note 
that Figures 9 and 11 appear to be mirror images of each other, with female 
candidates slightly exceeding male candidates in Figure 9, while the reverse 

ERIC -^8 



Figure 7 



Numerical and Pictorial Dlaplay of Frequenclea of Root Hean Wejlghted Squared 
Dlfferencea(RMWSD) Between the Conditional Probabllltlea of Succeaa 
for Fenale and Hale Candldatea on Mathenatlcal Iteaa froa Fom ZSA3 
Adalnlatered In December 1977 



NuMrlcal Frequenclea Grouped by Item Type Floating Hlatograma 

by Item Type 









1 Valueal 






1 Mathematical 


Regular 


Quaotltat iTe 


1 of 1 


Regular 


Quantitative 1 


1 Score 


Mathematlca 


Comparlaon 


1 ^ms> 1 
1 1 


Kathematlca 


Comparlaon I 








1 .20 1 












1 .19 1 












1 .18 1 












1 .17 1 












1 .16 1 










* 


t .15 1 
1 .14 1 
1 .13 1 
t .12 1 
1 .11 1 
1 .10 1 






1 2 


1 




1 .09 1 


R 


Q I 


1 3 


2 




1 .08 1 


RR 


Q 1 


1 5 






1 .07 1 


RXR 


<N 1 


1 5 






1 .06 1 


RRRR 


Q 1 


1 4 






1 .05 1 


RRRR 




1 13 






1 .04 1 


RRRRRRR 


QQQQQQ 1 


1 21 


12 




1 .03 1 


RRRRRRRRRRRR 


QQQQQQQQQ 1 


1 7 






1 .02 1 
i .01 1 
1 .00 1 


RRRKRRR 




r .03 ' 


•03 


.03 


1 Mode 






i .04 


•04 


.04 


1 Mean 






t .02 


$02 


.02 


1 S^D. 







Legend: 





No. of 




1 Item Type 


Iteme 


Abbrevlatlona I 


1 Mathematical Score 


60 




1 Regular Mathematlca 


40 


(R) 1 


1 Quantitative Comparlaon 


20 


(Q) 1 



23 



-25- 



Figure 8 

Plot of Root Mean Weighted Squared Differences (RMWSD^) Between 
the Conditional Probabilities of Success for Male and Female 
Candidates on Mathematics Items from SAT Form ZSA5 



o 
z 

u 
ac 
u 

Ix 
O 

u 
o 

o 



ITEM DISCREPANCY INDICES 
in FEMALE 



-QZ) 25 



* 

C 




-0.05 

DIFF 



-T r 

0 . 00 0 , 05 

% (;drrect 



25 



QUANTl FATIVE 
RF-.GUuAR MATH 



*RMWSD equali the distance from the origin to the point representing the item. 
Projection of each point on the horizontal axis yields the difference between 
Pj and Pj, Dj, for that item. Projection of each point on the vertical axis 

yields the standard deviation of the weighted differences, an index of residual 
crossover. 



ERIC 



30 



-26- 



Conditional Probability of Successful Performance for Both 
Males and Females on Two Math Items from SAT Form ZSA5 



Item 2 



Item 9 




^00 



400 too 600 

SCALED SCCHH 



70O 900 



* * fCMALE 
0 O MALI: 



Figure 9 




• "I ^ 

4O0 SOO 900 

SCALED SCO^r. 



t * FEMALE 
□ 0 MALE 



Figure 11 



•00 



Difference Plot of Two Math Items from SAT Form ZSA5 



Item 2 



C60 



— I — 
900 



— I 1 1 — 

400 SOD €00 

SCALED SCOflE 



700 OCD 



t • FEMALES - MALES 



OIFF % - -O.Om 
IHIWtD - 0.0963 



Item 9 




cOo 



300 



400 SOO • 

SCALED SCOWE 



700 600 



t • FEMALES - MALES 



OIFF X « 0.0703 
HMWSO • 0.06X6 



ERIC 



Figure 10 



31 



Figure 12 



- 27 - 



ERIC 



occurs in Figure 11. Both figure? exhibit fairly constant differences, but 
in opposite directions. Examination of the content of these two items orovided 
no apparent explanation for these differences. Hence, i% appears that all 
mathematics items on Form ZSA5 are relatively free from unexpected differ- 
ential item performance for females, despite the fact that the mean scaled 
score for female candidates was approximately one-half a standard deviation 
lower than the male candidate mean scaled score. The standardization procedure 
effectively adjusted for this difference in overall performance. 

TSWE Total Test and Item Type Results 

Table 4 contains a listing of the five indices, P^, P^, D^, RMWSD^, and P^^, 
discussed in preceding sections, for the 50 TSWE items on Form Ell. In addition, 
these indices are summarized by item type at the bottom of the table. The first 
row at the bottom of Table 4 contains means and standard deviations based on all 
50 TSWE items, and the next two rows contain the same information for the 35 
usage type items and the 15 sentence correction items, respectively. Estimated 
percent correct (P^) means for the female candidates are very close to actual 
(P^) means across both item types combined and separately. The mean values of 
RMWSD are similar to those observed for the mathematical items. No mean differences 
appear large enough to. warrant further consideration. 

Figures 13 and 14 contain pictorial and numerical displays of the discrep- 
ancy indices for all TSWE items on Form Ell. Inspection of these figures 
reveals that only two usage items exhibit any substantial amounts of unexpected 
differential item performance. Performance on these items is depicted in 
greater detail in Figures 15-18. The female candidates performed better than 
expected (P^ - .59 vs. P^ - .50) on the item displayed in Figures 15 and 16. 



32 



-28- 



Table 4 



Listing of Item Difficulty and Discrepancy Indices and 
Sumnary Statistics for TSWE Items from Form Ell 



llfeM • lltN TYPE S CeftKI CSl f COP«Kf 



1 




0*9214 




2 


USA'iE 


O.tOf 0 


0^il62 


3 


USAGE 


C*7P46 


0» 7992 




USACL 


i • 9P93 


r •^V r 1 


9 


USAwE 


0*69$2 


0.6724 


« 


US^CE 


0*7*19 


V • mum 


7 


USAGE 


0«t918 


^•1 f V% 


• 


USAGE 


0* 3234 


O^ J JwS 




USAGE 


C •6329 


!!• 7YcT 


IC 


USAGE 


C»544^ 


0^ 97 Jl 


11 


USAGE 


C tflYftS 


Vm w r9V 


17 


USAGE 


C« 7914 


W • r 


13 


USAGE 


0* P479 


U. 0 4Cc 


14 


U^ AGE 


C •9?9l 


• F 9 C 


19 


USAifl 


C« 9C94 


O^ 99 J9 


le 


USAGE 


C»?76P 


" • Z 9^9 


17 


USAGE 


f •5772 


r 4aca 

• ' • w^^m 


It 


USAGE 


C •( 233 


(!• C c vc 


19 


USAGE 


C •494t 


V •%99 J 


?3 


US*Gf 




«• 7C WS 


21 


USA'VF 


0* 93^1 


V. P 1 C Y 


22 


USAGE 


0«4944 


!#• 7 S^C 


23 


USAiffE 


r • 9 J ^ 1 


V • 9*C 1 


24 


USAGE 




C. 9930 


29 


USAGE 


o.44ro 


0.4269 


2e 


SfNI CCR 


C.920t 


€•9192 


27 


S6ni rcf 


d»7739 


f ^l'9 


2t 


SENT CC> 


0.7279 


0^7333 


29 


ScM COM 


0.9726 


^•9619 


3t 


Sf SI CCP 


V* Til re 


^•7977 


31 


sekt cop 


C.e«i79 


€•7169 


32 


St ST CC^ 


C«e4?3 


(*• 


33 


S£S1 iCh 


1.6411 


^•6320 


J* 




C «4C 12 


0^4429 


39 


SCS» CCfc 


C.fc497 


0^*931 


3* 




C.71C7 


r*,t939 


37 


SENT coil 


c.9re7 


0^6039 


3S 


SFS1 CC» 


C.9393 


^•9399 


3« 


SFN1 COP 


0.?t24 


0^4292 


4C 


sEsi crp 


C«44?3 


'^•4349 


41 


USAGC 


C.I424 


C^93d3 


42 


USAGE 


C.44(l 


0^4993 


43 


USAGE 


0.7641 


(•79^4 


44 


USAGfe 




^•4927 


49 


USAGE 


0.63V3 


^•A249 


4t 


USAGE 


0.6921 


^•6374 


47 


USAtfE 


0«9ft29 


0^9909 


48 


USAGE 


Q.6691 


^6709 


4$ 


USAGE 


0.6139 


0^9732 


M 


USAGE 


0.3191 


0^2999 




«o. Of 




» IP 


ItM Tjpf 


XtMU 


X IP 


TSUI 


SO 




•44 .14 




35 


.44 .li 


.44 •U 


StntMic« 


13 


• 45 .15 


.46 .14 


Corr«ctloa 





OUf t CCIIII6CI PMWSO t COMtECT BASE GROUP 



•©•Ov99 


C 0149 

w • V 19C 


\ C«9191 




0«0299 


\ 0^ 9020 


V • VI J V 


0,0317 


Ao^7924 


V •«#■■! 


' 0^0974 


^0^4(i94 


a 099 ■ 
!#• Vcc ■ 


0 ^0339 


0.4993 


«'>^C9A7 


0,0373 


0^7794 


O O 1 94 
!!• V Ic 9 


(,0240 


0* 9690 


"0^0072 


C ^0372 


0^3224 


0,0402 


0^C943 


€•9919 


«Q,Q2|I2 


0«04l4 


0^5991 


!!• Vw^T 


0«0297 


0^7409 




0,0293 


9^7710 


0,0117 


0^0229 


0*9249 


0, 0040 


0^0263 


0^793« 


.0^0939 


€•0929 


0^9791 


0«i)029 


0^9299 


0^3622 


-r>,0224 


0^0392 


€•9799 


0,0031 


0^0271 


€•4067 


.(^•0109 


0^0319 


, ^•4919 


-0^0191 


0^0340 


0^9639 


0«0292 


€•0399 


0^9CC9 


-0^0397 


0*0493 


0^9199 


C«0030 


0^0294 


0^9199 


0«0140 


C^0290 


0^ 9473 


0^0131 


0^C393 


0^4119 


0^0094 


0^0194 


C^9r73 


C^?330 


0^0379 


0.7269 


*0^0999 


0^0394 


€• 7213 


0^0112 


C^0299 


0^9911 


-Cr^ClOl 


C^C2C3 


€•7999 


*0^0199 


€•0310 


C^70C9 


*W^P0J9 


V • Wl f 9 


0«4219 


0^909l 


0^0262 


0^4143 


-^•0013 


0^0291 


0^4AK 


-0^0434 


0^0999 


0^4771 


0^0173 


€•0323 


0^4742 


•^•^l 7c 


r •f)334 


0^9933 


C*3039 


0^0290 


0^9164 


*0^0697 


0^0712 


€•4079 


0^0099 


^•0279 


0^4U4 


0^0121 


0^0213 


€•9133 


-0^0092 


0^0249 


0^4394 


0^0199 


0^0799 


0^7299 


0^C492 


0^9949 


€•4372 


0^0119 


€•0349 


€•6997 


0^0199 


0^C397 


0^6132 


0^0019 


0^C799 


€•^594 


C^044l 


0^C911 


0^9973 


0^0407 


€•0943 


€•9943 


0^0142 


€•0299 


€•2939 


X ID 




1 SP 


•00 ^03 


.04 •oa 


.43 .14 


•01 .03 


.04 .02 


.42 •IIT 


'.01 .02 


• 03 .01 


.44 .15 



33 



-29- 



NuMrlcal and Pictorial Dlaplay of Praquanclos of loot Mtan Valghttd Squarod 
Dlffarancas(tMWSD) iattftan tha Conditional Frobabllltlaa of* Succaaa 
for PaMla and Mala Candld.itaa oo TSUE Itau froa Fona ZSA5/I11 
Adttlnlatarad In Dacaabar 1977 



Muaarlcal Praquanclaa Groupad Floating Hlatograaa 

by Ita« Typa by Itaa Typa 









1 Valuaal 






1 TSWE 


Uaaga 


Santanca 


1 of 1 


Uaa«a 


Santanca I 


1 Scora 


Cor taction 


1 IHWSO 1 




Corractlon | 








1 •20 1 












1 •l^ 1 












1 .18 1 












1 17 1 
1 • A / 1 












1 .16 1 












1 .13 1 












1 .14 1 












1 -13 1 












1 .12 1 












1 .11 1 






1 1 


1 




1 .10 1 


U 




1 1 


1 




1 .09 1 
1 .08 1 


U 




1 1 




1 


1 .07 1 




C 1 


1 1 




1 


1 .06 1 




C 1 


1 5 


5 




1 •OS 1 


uuuuu 




1 10 


8 


2 


1 .04 1 


UUUUUUUU 


CC 1 


1 21 


13 


8 


1 .03 1 


UUUUUUUUUUUUU 


CCCCCCCC 1 


1 10 


7 


3 


1 .02 1 
1 .01 1 
1 -OO 1 


uuuuuuu 


CCC 1 


1 .03 


•03 


•03 


1 Moda 1 






1 .04 


•04 


•03 


1 Maan 1 






1 .02 


•02 


•01 


1 t.V. 1 







Lagand: 





No^ of 




1 It an Typa 


Itau 


Abbravlatlona | 


1 Taat of Standard Wrlttan 






1 EngUah Scora 


50 


TSWE 1 


1 Uaaga 


35 


(U) 1 


1 Santanca Corractlon 


15 


(C) 1 



-30- 



t 



Figure 14 

Plot of Root Mean Weighted Squared Differences (RMWSD^) Between 
the conditional Probabilities of Success for Male and Female 
Candidates on TSWE Items from Form ZSA5/E11 



ITEM DISCREPANCY INDICES 
U) FEMALE 



u 
o 
z 
u 

(T 
UJ 

u 
li. 

a 

u 
o 

Q 




-0 20 -0.15 



-0. "0 



-0.05 0 00 0.05 
OIFF V. CORRECT 



t 



0 SrN7r.NCE CORRECTION 



*iiMUSD .aual* th« distinct from th« origin to tht point r«pr«s«ntlng th« lt«». 
J^Sctlon of tlch ^Int on tht horlxont.l .xl. yl.ld. th. ^J"*^;";* J*^?"^ 
P^ind P ! V for that lt«.. Proj.ctlon of t.ch point on th. v.rtlc.l axl. 
yitld. tJt .t.nd.rd dtvl.tlon of th. wlght^i dlff.r.nct.. an lnd« of rt.ldu.l 
croasovar. 



35 



-31- 



Conditional Probability of Successful Item Perforaance 
for Both Males and Females on Two 
TSWE Items from Form ZSA5/E11 ^ 



Item 4 



Item 15 




30 40 50 

SCALED SCOKE 



• t h EUALt 
□ 0 MALE 



60 




T — I r 

30 40 SO 

SCALED SCOflE 



t 1 FCMALC 
0 0 UALE 



Figure 15 



Figure 17 



•0 



Item 4 



1*. 

5 • 



» t ! ! 



- T 1 I 

120 30 40 50 SO 

* SCALED SCOIIE 

t t FEMALES - MALES Otff X • D OMI 
RMWSD - 0.0974 



Figure 16 



Item 15 



•so 30 ^ 40 SO SO 

S6fcLED SCOKC 

t * FEMALES * MALES DIFF X - -0 Of 3S 
flMtfSb • 0.092« 



Figure 18 



36 



- 32 - 



Most of this difference is constant across levels of scaled score. On the Item 
displayed in Figures 17 and 18, the female candidates did not perform as well as 
expected, (P^ - .51, - .59). Again, most of the difference is in one direc- 
tion. Note that these two items appear to cancel each other out. 

Examination of the content of these two items revealed that the item on 
which females performed better than expected concerns a woman in a professional 
occupation, while the item on which females fell short of expectation deals 
with World War II, which is generally considered an area that males study 
more than females. However, these content differences do not appear to be 
sufficient explanations for the discrepancies in the observed and expected 
performance of female candidates on these items. 

* , # t 

Summary 

This report was the first in a series of investigations seeking to uncover 
evidence relating to the presetfce or absence of unexpected differential item 
performance on operational SAT/TSWE items across different candidate subpopu- 
lations' of the SAT/TSWE test-taking population via the statistical method of 
standardization. The use of standardization enables one to control for differ- 
ences in subpoptilation ability. Standardization is a reasonable procedure for 
controlling -for differences in ability, provided the control variable is a 
reasonable measure of ability, as is total scaled score. 

Examination of summary statistics for discrepancy indices at the item type 
level revealed that there was little evidence of systematic' unexpected differ- 
ential item performance on either the SAT-M or TSWE tests. On the verbal test, 
the analogy items exhibited a mean which suggested systematic unexpected 



/ 



- 33 - 



differential Item performance that favored the male candidates. Elimination of 
the one analogy Item which exhibited very substantial unexpected differential 
Item performance reduces the mean for analogy Items by half when that Item Is 
included In the set. I.e., from -.02 to -.01, suggesting that with the exception 
of that one Item, the analogy Items, as a set, exhibit little unexpected differ- 
ential Item performance. 

In contrast to previous Investigations of Item fairness (see review by 
Dorans, 1982), this Investigation of differential Item performance Identified 
very few Items out of a total of 195 Items as needing careful review for 
possible content bias. Of these only one exhibited a clearly unacceptable 
degree of unexpected differential Item performance that could be attributed 
to content bias. 

Since this Is the first application of the standardization approach to 
studies of unexpected differential Item performance, future applications are 
bound to Involve modifications of the method as employed here. Certain modifi- 
cations are very likely to occur. For example, different candidate subpopu- 
latlons will be studied and, as a consequence, the range of scaled scores studied 
may be curtailed. A variation of the standardization procedure that can be used 
with small samples may be employed. For some studies, the focus may be shifted 
away from breakdowns by Item type towards breakdowns by content, where feasible. 
In short, the methodology will be refined and adapted to meet the requirements 
of future applications. 



38 

o 

ERIC 

' • • 



- 34 - 



References 



Alderman, D. L. and Holland P. W. Item performance across native language 

groups on the Test of English as a Foreign Language (RR81-16) , Princeton, 
NJ: Educational Testing Service, 1981. 

Berk, R. A. (Ed.) Handbook of methods for detecting test bias . Baltimore 
Johns Hopkins Press, 1982. 

Carlton, S. T. , and Marco, G. L. Methods used by test publishers to "debias" 
standardized tests: Educational Testing Service. In R. A. Berk (Ed.), 
Handbook of methods for detecting test bias . Baltimore, MD: Johns Hopkins 
Press, 1982. 

Cook, L. , and Nutkowitz, I. Test analysis: College Board Scholastic Aptitude 
Test December 1977 Administration ZSA5 (SR-79--63T^ Princeton, NJ: 
Educational Testing Service, 1979. 

Donlon, T. F. The SAT in a diverse society: Fairness and sensitivity. 
The College Board Review , No. 122 (Winter 1981-82), 16-21, 30-32. 

Dorans, N. J. Technical review of item fairness studies: 1975-1979 (SR-82-90). 
Princeton, NJ: Educational Testing Service, 1982. 

Hecht, L. W., and Swineford, F. - Item analysis at Educational Testing Service , 
Princeton, NJ: Educational Testing Service, 1981. 

Shepard, L., Camilli, G., and Averill, M. Comparison of six procedures for 

detecting test item bias using both internal and external ability criteria. 
Journal of Educational Statistics , 1981, 6^, 317-375# 

Stern, J. Test analysis: College Board' Test of Standard Written English 

June 1976 Administration E11(YSA3) , Princeton, NJ: Educational Testing 
Service, 1977. 

Walker, R. C. A reader's guide to test analysis reports . Princeton, NJ: 
Educational Testing Service, 1981. 



3j 



ERIC 



Appendix 

THE STANDARDIZATION APPROACH TO ASSESSING 
UNEXPECTED DIFFERENTIAL ITEM PERFORMANCE 

Since the standardization approach to assessing unexpected differential 
item performance represents a new applicati'On of an old technique to an important 
concern in applied testing, the approach will be presented in detail in this 
appendix. First, the rationale for standardization will be discussed. Then, 
the particular application of standardization will be described. In the process ' 
of describing this approach to assessing unexpected differential item performance 
several terms and concepts will be defined. The goals of this appendix are: 

(1) to convey the ^implicity and generality of the standardization approach, 
and I 

(2) to illustrate its application to the assessment of unexpected differ- 
ential item performance. 

The Need for Standardization 

Standardization is a statistical technique that enables one to compare two 
populations of individuals with respect to some variable of interest while 
controlling for differences on some other variable that is related to the vari- 
able of interest. The best way to convey the meaning and importance of standard- 
ization is to illustrate what may occur when standardization is not performed 
when it should be. Simpson's paradox is the designation for a paradoxical 
situation in which a population with a higher overall incidence of some variable 
than a second population actually has a lower incidence of that variable 
than the second population when comparisons of that variable are conditioned on 
some oth^r variable. Simpson's paradox (Wagner, 1982) can be used to illustrate 
the importance of standardization. 



4u 



- 2 - 



Consider the following illustration. Table 1 contains a statistical 
description of the performance of two hypothetical groups, A and B, on an 
item. Group A is composed of 100,000 candidates, while Group B is composed of 
1,000 candidates. In the body of the table, the performance of the two groups on 
the item is summarized at the far right under the column heading overall perform- 
ance. Here we note that 60,000 of the 100,000 members of Group A answered the 
item correctly, while 500 of the 1,000 members of Group B answered the item, 
correctly. Since the 60% for Group A exceeds the 50% for Group B, we might 
conclude that this particular item favors Group A over Group B/ Such an inter- 
pretation, however, would be in error because it ignores important information 
about the two groups that is contained in the rest of the table, namely that 
Group A is more able than Group B. 

To the left of the overall performance column in Table 1 are five columns 
of numbers that describe the performance on the item of subgroups of A and B 
that are classified into five mutually exclusive performance levels, L1-L5. As 
is evident in the Z-Correct rows of the table, LI is the least able subgroup, L5 
is the most able, and L2, L3 and L4 are ordered from low to high in terms of 
performance on the control variable. At each ability level, members of Group A 
are as able as members of Group B. Thus, the 35,000 members of Group A at L4 are 
as able as the 150 members of Group B at L4.t 

The numbers in the first and fifth rows of the table identify the number 

c 

of individuals In Groups A and B, respectively, at each of the performance levels, 
llvese numbers inform us that overall Group A is more able than Group B with most 
oi Group A at levels L4 and L5 and most of Group B at L2 and L3. This substantial 
difference in overall ability between Groups A and B affects the summary infor- 



ERIC 



- 3 - 



Table 1 

Performance of Two Groups of Different Ability 
on an Item that Favors the Lower Ability Group 

Ability Level 

Overall 

LI L2 L3 L4 L5 Performance 

Group A ^ 



No. of Individuals 


5000 


15000 


25000 


35000 


20000 


100000 


% at Level 


.05 


.15 


.25 


.35 


.20 




Answer Correct 


500 


A500 


12500 


24500 


18000 


60000 


Z Correct 


.1 


.3 


.5 


.7 


.9 


.6 


Group B 














No. of Individuals 


200 


350 


250 


150 


50 


1000 


% at Level 


.20 


.35 


.25 


.15 


.05 




Answer Correct 


40 


140 


150 


120 


50 


500 


% Correct 


.2 


.4 


.6 


.8 


1.0 


.5 



42 



matlon portrayed In the overall performance column, ^hich had led us to conclude 
that the Item favored Group A over Group B. ^ 

A closer examination of all the information in Table 1, however, leads us 
to conclude that the item, in fact, favors Group B over Group A. The evidence 
for this conclusion is contained in the fourth and eighth rows of Table 1, 
which contain the percent correct for each of the five ability levels in groups 
A and B, respectively. Note that at each ability levels a larger percentage of 
Group B members answer the item correctly than do Group A members of comparable 
ability. This analysis, conditioned on ability level, indicates that this item 
favors Group B over Group A because the probability of successful performance 
on the item is .1 higher for Group B than Group A at each of the five ability 
levels. Simpson's paradox refers to the fact that the analysis conditioned on 
ability level contradicts the analysis based on a simple comparison of overall 
performance of the two groups on the item, i.e., the analysis based on the data 
in the overall performance column of Table 1. 

Standardization with respect to ability level removes the paradox in the 
item performance analyses by producing a simple total group comparison, like 
that based on the overall performance column, which is not confounded by 
differences in group ability. Standardization accomplishes this goal by using 
the same standard ability distribution for both groups. 

Definitions 

In the balance of this appendix, the following definitions will be employed 
to designate various subgroups and variables used by the standardization approach 
to the assessment of unexpected differential item performance : 

■ ' 43 



Variables * There are two types of variables: study and control > The 
study variable Is the variable of Interest, while the control variable Is a 
variable that Is related to the study variable and which must be controlled 
while making comparisons of the study variable. In the example under consider- 
ation, performance on the Item expressed as percent correct Is the study variable 
while ability level Is the control variable. Since percent correct Is related 
to ability level, the latter must be controlled for during comparisons of the 
former. 

Groups . There are three types of groups: study , standardization , and 
base . The study group, as the phrase Implies, Is the group under study. In 
any given Investigation, there are as many potential study groups as there ^re 
potential subgroups In a population. In actuality, certain subgroups^ e.^. 
Blacks, are more likely to be study groups because of concerns about the rele- 
vance of tests for these subgroups. 

The standardization group supplies the ability distributions used by the 
standardization approach. In any comparison of two groups, three possible 
standardization groups immediately suggest themselves: either of the two 
groups or a composite of the two groups. While all three of these groups are 
based on actual data, the standardization approach is not limited to standardi- 
zation groups based on actual data. A hypothetical ability distribution con- 
structed to suit some desiderata could be used a^ the standardization group. 

The base group supplies the model for the data to the standardization 
process. The model for the data expresses the study variable as a function of 
the control variable. In assessing unexpected differential item performance, 
the model is the expected performance on the item conditioned on ability. I.e., 
the expected probability of successful performance on the Item given ability 

44 



- 6 



level. As in the case of the study group, there are as many potential base, 
groups as there are potential subgroups. -A subgroup cannot be both the study 
group and the base group in the same analysis, however. To achieve a stable 
model for data, the base group should be as large as possible. To avoid part- 
total group contaminations, the base group should be independent of the various 
study groups In an investigation. 

In ^.avestigatlons of unexpected differential Item performance, the model 
for the data can be empirical or theoretical. An example of an empirical 
model in an investigation of unexpected differential item performance in a Black 
study group would be the conditional percent correct in a white base group. If 
an adjustment of percent correct for not reached, omits and number wrong served 
as the ||^^^ study group, an empirical model of the data would be the 

comparable adjusted percent correct observed in the base* group. Further 
discussion of adjusted percent correct is reserved for the mathematical formal- 
ization presented latter in this appendix. 

The various models of item response theory (Lord, 1980) are examples of 
theoretical models for the data. This appendix is limited to empirical models 
for the data. 



The mathematical formulation of the standardization approach to assessing 
unexpected differential Item performance can be described in several stages, 
each of which focuses on a different component. These components are: 



Mathemat ical Formalization 



I. Observed Study Group Data 



Basic Data 



B. Derived Data to be Modelled 




ERIC 



- 7 - 



II. The Model for the Data 
III. Definition of the Standardization Group 
IV. Statistical Indices of Unexpected Differential Item Performance 

Observed Study Group Data 

In the balance of this appendix, the following indices will be employed: 

- g is the subscript for subgroup and ranges from 1 to G, where G is the 

number of subgroups; 

- s is the subscript for scaled score or ability level and ranges from 1 to 

S, where S is the number of scaled score levels. For SAT-V and SAT^, 
S is 61; for TSWE, S is 41; 

- r is a response type indicator for which 

1 =■ correct response 

2 » incorrect response 

3 « omit 

4 « not reached. 

Basic Data. The basic data are counts, N , i.e., the number (frequency) 
gsr 

of people in subgroup g at ability level s who gave response type r to the item. 

0 

For example, N , is the number of people in g at ability level s who responded 
gsl 

correctly to the item, while N^^^ is the number of people in g at ability level 
s who omitted the item. If we let represent a simple unweighted sum, then 
Ng^^ is the number of people in g at s. In addition, N^^^ - N^^^ is the number 
of people in g at a .who reached the item. 

Derived Data to be Modelled . Some variation of percent correct are the 
data to be modelled for unexpected differential item performance. Simple 
percent correct at ability level s in subgroup g is defined as 

46 



- 8 - 



(1) P - N , / N . . 

^ ' gs gsl g8+ 

An alternative percent correct Involves a correction for not reached, 

Yet another "adjusted" percent cotrect entails an adjustment for guessing^ 

(3) Pg3(GA) . (Ng3j - Ngs2/(^-l» / V+ 

where k Is the number of options In the multiple choice question. Choice of 
"percent correct" depends on the purposes of the Investigation. Various choices, 
such as (1) - (3) above, can be obtained as a special case of a general formula 
for the data, ^ 

Z N * 
r-1 

(4) Pg,(W^) - V 

Z N * w 
r-1 8" 

where Is the rth element In the vector of weights applied to N^^^ to 

obtain the numerator of P (W ), while w^ is the rth element in the vector of 

gs r r 

weights applied to N to obtain the denominator of P^^(W^). For equations 
^ — r ^^'^ gsr gs r 

(1) to (3) above, the corresponding weight vectors, and are: 



Equation 



R, W, 0, NR R. W. 0, NR 

(1) (1. 0, 0, 0) (1, 1, 1, 1) 

(2) (1, 0, 0, 0) (1, 1, 1, 0) 

(3) (l,-l/(k-l), 0, 0) (1, 1, 1, 1) 



o 47 

ERIC 



t b 

Choice of W and W for use In (4) determines the data ^^AV) to be modelled. 
— r — r gs r 

In the example in Table 1, simple percent correct, equation (1), was used to 
obtain the data to be modelled. Dividing the nunbers in the third and seventh 
rows by the numbers in the first and fifth rows, respectively,' provides the 
simple percent corrects contained in the fourth and eighth rows, respectively, 
of Table 1. For example, the .5 (^^3) for group A at score level L3 is 
obtained by dividing 12,500 (N^^^jp by 25,000 (N^^j^.). 

The Model for the Data 

The data are defined as the percent correct for the study group. For an 
empirical model, the model for the data is simply the same percent correct 
for the base group. Both the data and the empirical model for the data 
are obtained via equation (4). For the data, the subscript g refers to the 
study group. Likewise, for the model, the subscript g refers to the base group 

When the data base is sufficiently large, as in the case with the SAT, it 
is often sensible to use the largest subgroup as the base group. In that case 
the model for the data can be obtained via a straightforward application of 
equation C4). In the hypothetical example depicted in Table 1, the base group 
model values for simple percent correct data are simply the observed percent 
cotrect data for group A, which are listed in the fourth row. 

Definition of the Standardization Group 

The standardization group supplies the standard ability distributions used 
by the standardization approach. Any of the G subgroups can be used as the 
standardization group. Since the standard ability distribution serves as a 
weighting function, it is advisable to use each study group as Its own standard 



4S 



-10 - 



Izatlon group thereby using a weighting function that mirrors the relative 
frequency at each score level In the study group. 

Formalizing the role of the standard ability distribution In the standardi- 
zation process Illustrates how It serves as a weighting function. As the phrase 
might Imply, "unexpected differential Item performance" focuses on unexpected 
differences In Item performances. Controlling for differences In subgroup abil- 
ity through standardization, enables ud to label as unexpected any difference 
between actual and expected Item performance. For subgroups composed of equally 
able members, there should be no differences In Item performance. For the 
SAT and TSWE, reported scaled scores are highly reliable measures of the devel- 
oped abilities assessed by that testing Instrument. It Is therefore- reasonable 
to presume that Individuals at the same scaled score ability level across 
subgroups should have the same probability of successful performance on the 
Item. Hence unexpected differential Item performance focuses on differences In 
Item performance at fixed score levels. For SAT-V and SAT-M, there are 61 
reported score levels, and for TSWE, there are 41 reported score levels. 
Standardization affords us with a simple way of summarizing unexpected differ- 
ences in each item performance across score levels. For both SAT-V and SAT^> 
it enables us to reduce 61 potential differences to two summary indices without 
the confounding effects due to differences in group ability. For TSWE, 41 
potential differences are reduced to two summary indices. 

Statistical Indices of Unexpected Differential Item Performance 
At each score level s, in group g, we have the difference, 

(5) D « P - P , 

. gs gs gs 



43 



where P is observed data defined in (4) using the study groups counts, N » 
gs K***^ 



and P is the model for the data defined via (4) using the base groups counts. 



In equation (5), D is a conditional difference between the data and the model. 

gs 

Let W be the standardization group weighting function for study group g. A 
sensible weighting function containsf _ the relative frequencies of scaled score s 
in study group g, i.e., 

(6) W - N . / N _ 
^ gs gs+ g++ 

where N . is the number of individuals in group g at score level s and N ^ is 
g8+ 8^ 

the number of individuals in group g across all s score levels. 

Applying each W^^ to its corresponding conditional difference and summing 
across score levels yields a mean weighted difference, 

S 

(7) D « 2 W D 

an overall difference between Liie data and the nodel for percent correct. 
This difference is one index of unexpected differential item performance 
supplied by standardization with respect to ability. A second index is the 
mean weighted squared difference, 

S 2 

(8) MWSD - E W D 

s-1 88 88 

which can be rewritten as 

(9) MWSD - J^Wg^V^g^ 

> 

which implies that each difference is weighted by itself aa well by the 

weighting function associated with the standardization group. The J square root 

- -' 1 i 

. / 

00 



- 12 



of MWSD is also an index of discrepancy, RMWSD, that is on a scale that is 
comparable to D • 

To illustrate the standardization process, let us return to the data in 
Table 1. Suppose Group B were the study group, chosen as such because its lower 
ability level led critics of testing to believe that test items were biased 
against- Group B. Since there arS iOO.OOO individuals in group A, it was chosen 
as the base group. Since we are primarily interested in study group B, its 
Ability distribution supplies us with a natural weighting function. Hence, the 
data, model and weighting function are: 



BS 

- .20 

- .35 

- .25 

- .15 

- .05 





^BS 




^BS ■ 


■ ^AS 


W 


LI: 


40 =■ 
200 


.20 


500 
5000 


» .10 


200 
1000 


L2: 


140 - 

350 


.40 


4500 
15000 


- .30 


350 
1000 


L3: 


150 - 

250 


.60 


12500 
25000 


- .50 


250 
1000 


L4: 


120 - 

150 


.80 


24500 
35000 


- .70 


150 
1000 


L5: 


50 - 

50 


1.0 


18000 
20000 


- .90 


50 
1000 



E - 1.0 



Note that, as with all weighting functions, Z W^g » 1.0. Using the information 
above, we obtain 



51 



- 13 - 





P P 
BS BS 


- ^AS 






"bs°bs 


**BS BS 


LI: 


.2 


.1 


.1 


.20 


.020 


.0020 


L2: 




.3 


.1 


.35 


.035 


.0035 


L3: 


.6 


.5 


1 


.25 






L4! 


.8 


.7 


.1 


.15 


.015 


.0015 


L5: 


1.0 


.9 


.1 


.05 
Z -1.0 


.005 
Z -.1 


.0005 
Z -.01 


row 


above reveals 


that 


- .1 


and MWSDg 


- .01 when 


Group A 1 



group. Note that. - MWSDg. which indicates that all the sum of squared 
differences are due to the constant difference of .1 observed at each score 
level. 

Contrasting Standardization With Ot her Approaches 

The assessment of unexpected differential item performance is an important 
concern in applied testing. As such it has attracted much attention, e.g., 
Berk's (1982) Handbook of Methods for Detecting T est Bias. From the title of 
Berk's volume one might infer that several methods for bias detection exist, and 
the contents of the volume confirm this inference. The intent of this closing 
section is to place the standardization approach within the context of the 
methods included in the Berk volume. 

Scheuneman (1981) makes a distinction between two general types of item 
bias definitions: definitions related to an item-by-group interaction, e.g., 
Angoff 's (Angoff and Ford, 1973) transformed item difficulty approach, and defi- 
nitions that involve conditioning on ability, e.g., item response theory 



52 



approaches (Lord, 1980). Unexpected differential item performance is clearly a 
definition involving conditioning on ability. The standardization approach to 
assessing unexpected differential item performance is most akin to item response 
theory methods. 

In item response theory approaches, parameterized item-ability regressions, 
or item response functions, for different subgroups are computed and compared. 
In the standardization approach, unparameterized item-test regressions are 
compared. While the parametric nature of the item response methods are more 
elegant, the particular model (e.g., one-parameter), may not fit the data and 
the lack of fit might be misconstrued as bias. In contrast, unparameterized 
item-test regressions will not suffer from model fit problems. Like any method 
that uses an internal criterion, however, unparameterized item-test regressions 
are subject to bothersome item-total contaminations. 

While the standardization approach is more akin to parametric item response 
theory methods, it shares some of the simplicity of the transformed item diffi- 
culty or delta-plot method. It too results in "transformed" item difficulties, 
namely the predicted p-values obtained from applying the marginal ability 
distribution of the standardization group to the base group conditional item 
success curves. These predicted p-values are the item difficulties one would 
expect if both the base group and the study group had ability distributions like 
that of the standardization group. These predicted difficulties should be 
identical because ability has been directly controlled for through standardi- 
zation. Any substantial deviation from identity could be construed as evidence 
of unexpected differential item performance, evidence stated in the simple 
metric of proportion answering an item correctly. 



53 



ERIC 



- 15 - 



Reference 



Angoff, W. H. , and Ford, S. F. Item-race interaction on a test of scholastic 
aptitude. Journal of Educational Measurement , 1973, J^, 95-106. 

Berk, R. A. (Ed.) Handbook of methods for detecti ng test bias. Baltimore, MD: 
Johns Hopkins University Press, 1982. 

Lord, F. M. Applications of item response theory to practical testing problems. 
Hillsdale, NJ: Erlbaum, 1980. 

Scheuneman, J. D. A new look at bias in aptitude tests. In P. Merrifield (Ed.), 
New Directions for Testing and Measurement; Measuring h uman abilities. No. 
12. San Francisco: Jossey-Bass, 1981. 

Wagner, C. H. Simpson's paradox in real life. The America n Statistician, 1982, 
36 (1), 46-47. 



54 



4 



